Batting Statistics and Salary in the MLB in 2015.

28 March 2022

Authors: Scott Masterson, Alex McMillen, Matthew Burke

Table of Contents

Introduction

According to Business Insider, starting pitchers were among the highest earners in the MLB in recent years. These elite pitchers are reliable game-changers that bring consistency to any successful franchise. But what about the hitters? The guys that go out there and get that clutch walk-off home run -- the same guys who steal a crucial base in the bottom of the 8th. These hitters bring notecable value in their position on the field too, be it 3rd base or left field or catcher, but what value can we measure in regard to their hitting stats? The latter raises the research question:

Which batting statistics result in a higher salary in major league baseball?

baseball.png

We are interested in this topic of research since several of us in our group grew up playing baseball. Nowadays, we're all student athletes at Emory University. Also, I was a big baseball card collector and I really liked looking for the players with higher stats. I remember how exciting it was back in 2012 when Miguel Cabrera won the Triple Crown by leading the league in batting average, home runs, and runs batted in.

We're curious to see how the three stats included in the Triple Crown (among others) affect how much a player earns in a given season. Since I would argue salary is also a function of how valuable the player is to a team, I would hypothesize that:

(The converse is relevant for each and the variables above may not be the variables examined later in the investigation)

External Data Motivations: key variables at a glance

Before jumping into the data set, it's important to understand which stats are sought-after in baseball for winning teams and successful players.

Batting Average (BA) and On Base Percentage (OBP)

The article by Bleacherreport titled, "On Base Percentage Vs. Batting Average" takes a deep dive into these two stats. The author argues which stat is better for determining the value a player brings behind the plate, and discusses how the "walk" stat is often overlooked by many. He finalizes the argument by assessing the limitations of using on base percentage which include walks -- hits are more likely to cause RBI, runs, or HR, so both variables are key in understanding the value of different players throughout the major league.

Regarding this investigation, both stats can be used in analysis; however, the data set we have selected does not have either stat included within its current variables so they must be added to the data frame individually using the formulas below:

Batting average: $$ BA = \frac{Hits}{At Bats}$$

On base percentage: $$ OPB = \frac{Hits + Walks + Hit by Pitch}{At Bats + Walks + Hit by Pitch + Sacrifice Flies}$$

Homeruns

The article by a sabermetrics community titled, "Do more home runs mean more wins?" discusses how homeruns are an important factor in analyzing a team's win percentage. The article noted that the teams with higher homeruns often had a higher win percentage, as shown below in this simple linear regression:

Screen%20Shot%202022-04-01%20at%208.24.13%20PM.png

The regression gave an R-Squared value of 0.1752 and a correlation coefficient of 0.418. For some teams, a higher amount of homeruns didn't indicate a higher win percentage and the article mentioned that this could be due to a variety of factors like the team lacking in good starting pitchers or not getting that many runners on base.

For this investigation it may be valuable to see how homeruns affect salary -- it would make sense. Homeruns not only win games, but they're exciting and keep the crowd engaged. It's logical to assume that players who have high amounts of homeruns would have a higher salary since they're directly contributing to the team's success and ticket sales.

The Data Set

We sourced the data set from seanlahman.com under the baseball archive (see sources). Originally we planned to use a smaller, outdated data set from Rutgers which was in a cleaner, more managable form than this new data set; however, that data set possessed limitations with it's age and also it's lack of variables to examine.

The data set we are using now contains a variety of interesting variables to be included in our analysis and more current statistics. The data set is a census of every baseball player in the MLB from 1871 to 2020, but we will be focusing on the year 2015 to ensure the data is cross-sectional. We selected this year using a random number generator from years 2000 to 2015. We didn't want to select years before 2000 to keep the data more current and we wanted to randomize our selection since we were indecisive about which year to select and it ultimately doesn't matter for the analysis we plan to perform.

Creating, Cleaning, and Manipulating the Data Frames

The two CSV files will be independently filtered by observation then merged together, the comments within the code discuss this in greater detail.

Variable Investigation

In this section we will assess the strengths and weakenesses of each variable that may be relevant to our investigation. From there we will select variables to use in the regression under the methodology section.

At a glance, there are a variety of variables from which to focus upon. It'll be difficult to tell which variables will have a notiecable effect on our dependent variable without testing (which will be done in a later section), but we can do some preliminary analysis using correlation. We've constructed a correlation matrix for all variables in the dataset and transformed it into a heatmap. The heatmap allows us to visualize which variables are most/least correlated with one another and with the variable of interest, salary.

Stint and yearID have been removed just for the purpose of display in this heatmap.

In the top left corner, the variables appear to be heavily correlated with one another. On the far right column (or bottom row) we can see how each variable is correlated with salary. It appears that most of the variables are only slighlty positvely correlated or even negatively correlated with salary. Negative correlation would make sense for variables like strikeout or caught stealing, but I would have expected to see games or hits have more of a positive relationship with salary.

Variable Creation

Creation of batting average and on base percentage as variables in the data set. As mentioned in the external data motivation section, we believe the two stats are useful when expressing a player's value to a winning team and therefore may be indicators of wage. We can examine this further in the methodology section.

We realized the dataset we chose did not have a stat for the players' batting averages so we created this variable and added it the dataframe as another independent variable which is shown in the output above. Also, a log_salary variable has been constructed using numpy for facilitated implementation later on. Using the logarithm of salary makes sense to account for outliers and it leads to more appealing visualiztion.

Creating some Categorical Variables to be Used in Visualizations and Further Analysis

Using games (G) to slice up the data into two categories: "Plays Infrequently" and "Plays Frequently." Plays Infrequently includes all batters who played 6 (min) to 86 (median) games, and Plays Frequently includes all batters from 87 to 162 games

The same approach is used below but with the at bats (AB) variable.

Visualizations

A few graphs will help us in selecting which variables will be the most interesting to examine in our analysis. While some of the first few graphs don't help us directly visualize salary initially, seeing the distribution of individual explanatory variables is useful for further analysis.

The distribution of RBIs is skewed right and not symetric. This is unsurprising, as the majority of players would only have < 20 RBI's due to lack of playing time while the few "star players" have as much as 100 RBI's or even more.

The distribution of batting average is skewed left and not symmetric. The batting average with the highest frequency is approximately 0.265 while the majority of batting averages fall somewhere between 0.2 and 0.3.

Similar to the distribution of batting average, the distribution for OBP is skewed left (though maybe slightly less) and not symmetric. As one would expect, the highest frequency for OBP is higher than that for batting average (average is only hits while OBP also includes walks, HBP, fielder's choice, etc.), falling at approximately 0.325. The majority of observations fall between approximately 0.275 and 0.365, which is also higher than that for batting average.

Coloring by Salary: Interesting Findings in the Data

After reviewing several variables above separate from salary, more complex plots including our dependent variable can help motivate the methodology section that follows.

This graph shows that RBI's are strongly positively correlated with games. These findings are unsurprising for two reasons. First, if a player is in more games then they get more oppurtunities (AB's) to score runners. Second, better players are more likely to get an RBI at any given AB, and better players tend to play more games. Regarding salary which is expressed in millions of US Dollars, it appears that darker-colored dots (which indicate a lower salary) lie closer to the origin, but all types of dots are spread out amidst this scatterplot.

This graph shows that RBI's are very strongly positively correlated with hits. Obviously this is unsuprising, as the more hits a player has, the more likely it is that they got a hit while a runner was in scoring position, thus getting an RBI. Again regarding salary which is expressed in millions of US Dollars, it appears that more lightly-colored dots (which indicate a lower salary) lie closer to the origin.

Plotly Graphs for Further Investigation and Interactivity

Plotly allows the user to hover over different parts of the graph and view metrics at each different observation.

This scatterplot shows that typically as the amount of games increases, so does the amount of hits. This is valuable since it supports the common theme that generally the more games played, the higher the other stats will likely be. While the different colors of log(salary) dots are scattered thoroughly throughout the scatterplot, darker dots (lower salary) tend to be concentrated more toward the origin.

From this graph, we can see that there is not a significant difference between the distribution of salary between players who bat frequently vs those who bat infrequently. The log(salary) variable is clustered around a value just greater than 13 for both distributions, but this observation is slightly more pronounced for those players who bat frequently. The log(salary) variable is also slightly clustered around a value of 16 for those who bat infrequently, but this observation is hardly noticeable for those who bat frequently.

This visual is rather similar to the one previously and this could be due to the fact that both the games and at bats variables have to do with the frequency for which one plays. Like the graph before, the median log(salary) for those who fall into the category "Plays Frequently" is slighlty higher than the median log(salary) for players who fall into the category "Plays Infrequently."

Methodology/Empirical Model

In this section we will utilize the variables we have selected based on our research and correlation matrices. We will construct severl regression models like the one shown below:

$$ log(salary) = \beta_0 + \beta_1*RBI + \beta_2*BA + \beta_3*SB + \beta_4*H + u $$

The following regression motivates this section and the results section which are to be completed for the final portion of this investigation. The model which includes all variables of interest is shown below:

*stargazer was not used for this OLS output since the amount of variables resulted in a distracting display due to the length of the stargazer output. Stargazer will be used later in the investigation.

From the OLS summary table we can see that there are very few variables that are significant predictors of salary. Surprisingly, there is very little positive correlation between stats such as homeruns, RBI's, and extra base hits (doubles and triples) with salary. The few variables that appear to have a significant effect on salary are games, at bats, and triples (though this is suprisingly a negative correlation). To further evaluate the selected variables, we have found the confidence intervals as shown below:

Notice that almost all of the confidence intervals constructed for each variable contain 0. This means that there is a good chance of finding no correlation between some of the variables and our dependent variable. Finding zero in the interval could also mean that we have uncertainty about whether there is a treatment effect. According to the table, there are only a few variables with confidence intervals that don't contain zero:

Below we have computed some t stats and obtained p values for variables with confidence intervals that contain zero.

Not suprisingly, almost all of the variables don't have a p value of significance at common values of alpha -- which supports the evidence found from the confidence intervals. In order to determine the significance of the above variables, we ran a T-test with the null hypotheses $H_0: \beta_i=0$. As we can see, none of these variables are significant predictors of salary at the alpha value of 0.05. Oddly, SO's have the lowest p-value at approximately 0.09, which is significant at the alpha value of 0.1. Therefore, we fail to reject the null hypothesis that these variables are zero.

Below we have computed the t stats and p values for the variables with confidence intervals that don't contain zero.

The four stats above all render p values that are significant as to reject the null hypothesis that $H_0: \beta_i=0$, for common values of alpha: G has a p value far below 0.05 for example. It would make sense that these stats would affect salary in a noticable way: the more games a player plays, the more liekly they're a starter and a contributing member of the team -- this is probably true for at bats as well. Furthermore, having a high amount of base on balls is indicative of an experienced, patient hitter and triples are rare and may only come from the most experienced batters.

These findings lead us to construct another regression model which includes these four variables:

$$ log(salary) = \beta_0 + \beta_1*G + \beta_2*AB + \beta_3*triple + \beta_4*BB + u $$

Removing the variables that proved to be insignificant, we are left with the above regression model for log(salary). As we can see G and AB appear to have little effect on the variance of salary, however triples accounts for almost 17% of the variance in log(salary).

Lastly, we ran an F-test to determine the joint significance of the variables in the final model, with the null hypothesis $H_0: \beta_1=\beta_2=\beta_3$. Thus, since the F value is larger than the critical value, we fail to reject the null hypothesis.

Examining Batting Average and On Base Percentage

As outlined in the motivation section, these two variables are of high interest to explain the dependent varibale salary so they will be examined in this individual section using stargazer.

$$ log(salary) = \beta_0 + \beta_1*BA + \beta_2*OBP + u $$

It make sense to include OBP in the regression because it's an effective measure of a baseball player's expertise when batting; they know when to swing and when not to, it's not just related to hits like the article in the introduction section mentioned. This is under the assumption that a higher on base percentage is more valuable to the player and therefore the baseball team -- it would make sense for the variable to have a positive effect on salary.

From the table we can see that OBP is more strongly correlated with salary than average (a coefficient of around 2 for OBP vs a coefficinet of around 1 for average). From this we see that BA affect salary more than OBP when alone in the model. Interestingly enough when we put them in the same model, OBP affects salary more.

Conclusion

In this study, our primary objective was to determine which batting stats were most strongly correlated with an increase in salary for baseball players. To achieve this, we first singled out the year 2015 to obtain cross-sectional data and then filtered out players with less than 25 at bats to avoid any outliers (such as a player who was being payed 20 million dollars a year but got injured after his second game).

The three stats that we were most interested in and expected to be most strongly positively correlated with salary were batting average (BA), on base percentage (OBP), and homeruns (HR). To our surprise, all three of these variables were only slightly correlated with salary, and none of them were found to be significant predictors of salary (p-value < 0.05). Instead, what we found was that the only two variables that were significantly correlated with salary were games played and at bats. It makes sense that games played and at-bats are positively correlated with salary, as better players, who get payed more, are likely going to be a starting player for their team. Despite this, we are still unsure why these two stats were the only significant predictors of salary, as the same logic would apply to stats such as homeruns.

To improve this investigation it may have been useful to include position as a variable in the analysis; however, the data set we used did not have the variable easily-accessible. I planned on joining another dataset which included players' positions, but I ran into a major problem: observations between datasets often had multiple positions listed which resulted in replication of observations when joining. It would be bad statistical practice to omit certain position duplicates without sound evidence, so the issue was avoided altogether. I recently learned that wooldridge offers a MLB dataset with salaries and many other stats similar to those within this dataset. I would be curious to see if that dataset more cleanly inegrates a player's position since I believe position is inseparably correlated with salary after seeing the results of this investigation, or lacktherof.

One of the biggest caveats of this study is that we only chose the year 2015 to investigate. It is entirely possible that if we were to choose a different year, or include multiple other years in addition to 2015, that we would find different results (although this might not be true). Additionally, we filtered on players with greater than 25 at bats. If we chose to include all players, or we chose to include only players with greater than 100 at bats, it is very likely that we would have obtained different results. These issues could easily be solved by running the same study as above multiple times while including these differnet specifications. For example, we could run the same study but with players from the year 1995, or with players from the 1970's. Additinally we could run the same study but with all players regardless of the number of at bats, or with only players who have greater than 100 at bats.

Sources

Salaries article: https://www.businessinsider.com/highest-paid-mlb-players-2015-5

Baseball statistical archive: https://www.seanlahman.com/baseball-archive/statistics/

Variable Explanation by CRAN: https://rdrr.io/cran/Lahman/man/Batting.html

Bleacherreport article: https://bleacherreport.com/articles/40132-on-base-percentage-vs-batting-average-which-is-more-important#:~:text=Since%20the%20beginning%20of%20baseball,retained%20the%20highest%20batting%20average.

Beyondtheboxscore article: https://www.beyondtheboxscore.com/2016/9/9/12842846/home-runs-wins-correlation-2016

 


Python Programing Laboratory

Scott Masterson, Alex McMillen, Matthew Burke

Spring 2022